AITopics | duplicate detection

Collaborating Authors

duplicate detection

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Critical appraisal of artificial intelligence for rare-event recognition: principles and pharmacovigilance case studies

Noren, G. Niklas, Meldau, Eva-Lisa, Ellenius, Johan

arXiv.org Artificial IntelligenceOct-7-2025

Many high-stakes AI applications target low-prevalence events, where apparent accuracy can conceal limited real-world value. Relevant AI models range from expert-defined rules and traditional machine learning to generative LLMs constrained for classification. We outline key considerations for critical appraisal of AI in rare-event recognition, including problem framing and test set design, prevalence-aware statistical evaluation, robustness assessment, and integration into human workflows. In addition, we propose an approach to structured case-level examination (SCLE), to complement statistical performance evaluation, and a comprehensive checklist to guide procurement or development of AI models for rare-event recognition. We instantiate the framework in pharmacovigilance, drawing on three studies: rule-based retrieval of pregnancy-related reports; duplicate detection combining machine learning with probabilistic record linkage; and automated redaction of person names using an LLM. We highlight pitfalls specific to the rare-event setting including optimism from unrealistic class balance and lack of difficult positive controls in test sets - and show how cost-sensitive targets align model performance with operational value. While grounded in pharmacovigilance practice, the principles generalize to domains where positives are scarce and error costs may be asymmetric.

large language model, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2510.04341

Country: Europe (0.46)

Genre: Research Report (1.00)

Industry:

Information Technology (0.93)
Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (0.48)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Applied AI (1.00)
(4 more...)

Add feedback

MHSNet:An MoE-based Hierarchical Semantic Representation Network for Accurate Duplicate Resume Detection with Large Language Model

Li, Yu, Chen, Zulong, Xu, Wenjian, Wen, Hong, Yu, Yipeng, Yiu, Man Lung, Yin, Yuyu

arXiv.org Artificial IntelligenceSep-8-2025

To maintain the company's talent pool, recruiters need to continuously search for resumes from third-party websites (e.g., LinkedIn, Indeed). However, fetched resumes are often incomplete and inaccurate. To improve the quality of third-party resumes and enrich the company's talent pool, it is essential to conduct duplication detection between the fetched resumes and those already in the company's talent pool. Such duplication detection is challenging due to the semantic complexity, structural heterogeneity, and information incompleteness of resume texts. To this end, we propose MHSNet, an multi-level identity verification framework that fine-tunes BGE-M3 using contrastive learning. With the fine-tuned , Mixture-of-Experts (MoE) generates multi-level sparse and dense representations for resumes, enabling the computation of corresponding multi-level semantic similarities. Moreover, the state-aware Mixture-of-Experts (MoE) is employed in MHSNet to handle diverse incomplete resumes. Experimental results verify the effectiveness of MHSNet

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2508.13676

Country: Asia > China > Zhejiang Province (0.15)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.46)

Add feedback

On Parallel External-Memory Bidirectional Search

Siag, Lior, Shperberg, Shahaf S., Felner, Ariel, Sturtevant, Nathan R.

arXiv.org Artificial IntelligenceDec-31-2024

Parallelization and External Memory (PEM) techniques have significantly enhanced the capabilities of search algorithms when solving large-scale problems. Previous research on PEM has primarily centered on unidirectional algorithms, with only one publication on bidirectional PEM that focuses on the meet-in-the-middle (MM) algorithm. Building upon this foundation, this paper presents a framework that integrates both uni- and bi-directional best-first search algorithms into this framework. We then develop a PEM variant of the state-of-the-art bidirectional heuristic search (BiHS) algorithm BAE* (PEM-BAE*). As previous work on BiHS did not focus on scaling problem sizes, this work enables us to evaluate bidirectional algorithms on hard problems. Empirical evaluation shows that PEM-BAE* outperforms the PEM variants of A* and the MM algorithm, as well as a parallel variant of IDA*. These findings mark a significant milestone, revealing that bidirectional search algorithms clearly outperform unidirectional search algorithms across several domains, even when equipped with state-of-the-art heuristics.

algorithm, artificial intelligence, node, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.3233/FAIA240991

2412.21104

Country:

Asia (0.46)
North America > Canada (0.28)

Genre: Research Report > New Finding (0.46)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)

Add feedback

Parallel Strategies for Best-First Generalized Planning

Fernández-Alburquerque, Alejandro, Segovia-Aguas, Javier

arXiv.org Artificial IntelligenceAug-2-2024

In recent years, there has been renewed interest in closing the performance gap between state-of-the-art planning solvers and generalized planning (GP), a research area of AI that studies the automated synthesis of algorithmic-like solutions capable of solving multiple classical planning instances. One of the current advancements has been the introduction of Best-First Generalized Planning (BFGP), a GP algorithm based on a novel solution space that can be explored with heuristic search, one of the foundations of modern planners. This paper evaluates the application of parallel search techniques to BFGP, another critical component in closing the performance gap. We first discuss why BFGP is well suited for parallelization and some of its differentiating characteristics from classical planners. Then, we propose two simple shared-memory parallel strategies with good scaling with the number of cores.

algorithm, best-first generalized planning, parallel strategy, (13 more...)

arXiv.org Artificial Intelligence

2407.21485

Country: Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)

Add feedback

Duplicate Detection with GenAI

Ormesher, Ian

arXiv.org Artificial IntelligenceJun-17-2024

Customer data is often stored as records in Customer Relations Management systems (CRMs). Data which is manually entered into such systems by one of more users over time leads to data replication, partial duplication or fuzzy duplication. This in turn means that there no longer a single source of truth for customers, contacts, accounts, etc. Downstream business processes become increasing complex and contrived without a unique mapping between a record in a CRM and the target customer. Current methods to detect and de-duplicate records use traditional Natural Language Processing techniques known as Entity Matching. In this paper we show how using the latest advancements in Large Language Models and Generative AI can vastly improve the identification and repair of duplicated records. On common benchmark datasets we find an improvement in the accuracy of data de-duplication rates from 30 percent using NLP techniques to almost 60 percent using our proposed method.

candidate record, distance metric, duplicate detection, (12 more...)

arXiv.org Artificial Intelligence

2406.15483

Country:

Europe > Germany > Saxony > Leipzig (0.05)
Europe > Germany > Bavaria > Upper Bavaria > Munich (0.04)

Genre: Research Report (0.50)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.73)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)

Add feedback

Combining Embeddings and Domain Knowledge for Job Posting Duplicate Detection

Engelbach, Matthias, Klau, Dennis, Kintz, Maximilien, Ulrich, Alexander

arXiv.org Artificial IntelligenceJun-10-2024

Job descriptions are posted on many online channels, including company websites, job boards or social media platforms. These descriptions are usually published with varying text for the same job, due to the requirements of each platform or to target different audiences. However, for the purpose of automated recruitment and assistance of people working with these texts, it is helpful to aggregate job postings across platforms and thus detect duplicate descriptions that refer to the same job. In this work, we propose an approach for detecting duplicates in job descriptions. We show that combining overlap-based character similarity with text embedding and keyword matching methods lead to convincing results. In particular, we show that although no approach individually achieves satisfying performance, a combination of string comparison, deep textual embeddings, and the use of curated weighted lookup lists for specific skills leads to a significant boost in overall performance. A tool based on our approach is being used in production and feedback from real-life use confirms our evaluation.

detection, duplicate, job description, (16 more...)

arXiv.org Artificial Intelligence

2406.06257

Country:

Europe > Germany > Baden-Württemberg > Stuttgart Region > Stuttgart (0.05)
North America > United States > New York > New York County > New York City (0.04)
North America > United States > Arkansas > Pulaski County > Little Rock (0.04)
(4 more...)

Genre: Research Report > New Finding (0.46)

Technology:

Information Technology > Data Science > Data Mining (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

MultiSiam: A Multiple Input Siamese Network For Social Media Text Classification And Duplicate Text Detection

Bhoi, Sudhanshu, Markhedkar, Swapnil, Phadke, Shruti, Agrawal, Prashant

arXiv.org Artificial IntelligenceJan-6-2024

Social media accounts post increasingly similar content, creating a chaotic experience across platforms, which makes accessing desired information difficult. These posts can be organized by categorizing and grouping duplicates across social handles and accounts. There can be more than one duplicate of a post, however, a conventional Siamese neural network only considers a pair of inputs for duplicate text detection. In this paper, we first propose a multiple-input Siamese network, MultiSiam. This condensed network is then used to propose another model, SMCD (Social Media Classification and Duplication Model) to perform both duplicate text grouping and categorization. The MultiSiam network, just like the Siamese, can be used in multiple applications by changing the sub-network appropriately.

category, detection, multisiam, (11 more...)

arXiv.org Artificial Intelligence

2401.06783

Country: Asia > India > Maharashtra > Pune (0.04)

Genre: Research Report (0.40)

Industry: Media (0.47)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Applying Machine Learning for Duplicate Detection, Throttling and Prioritization of Equipment Commissioning Audits at Fulfillment Network

Halawa, Farouq, Abdul, Majid, Mohammed, Raashid

arXiv.org Artificial IntelligenceSep-28-2022

VQ (Vendor Qualification) and IOQ (Installation and Operation Qualification) audits are implemented in warehouses to ensure all equipment being turned over in the fulfillment network meets the quality standards. Audit checks are likely to be skipped if there are many checks to be performed in a short time. In addition, exploratory data analysis reveals several instances of similar checks being performed on the same assets and thus, duplicating the effort. In this work, Natural Language Processing and Machine Learning are applied to trim a large checklist dataset for a network of warehouses by identifying similarities and duplicates, and predict the non-critical ones with a high passing rate. The study proposes ML classifiers to identify checks which have a high passing probability of IOQ and VQ and assign priorities to checks to be prioritized when the time is not available to perform all checks. This research proposes using NLP-based BlazingText classifier to throttle the checklists with a high passing rate, which can reduce 10%-37% of the checks and achieve significant cost reduction. The applied algorithm over performs Random Forest and Neural Network classifiers and achieves an area under the curve of 90%. Because of imbalanced data, down-sampling and upweighting have shown a positive impact on the models' accuracy using F1 score, which improve from 8% to 75%. In addition, the proposed duplicate detection process identifies 17% possible redundant checks to be trimmed.

artificial intelligence, machine learning, natural language, (15 more...)

arXiv.org Artificial Intelligence

2209.14409

Genre: Research Report (0.83)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.75)

Add feedback

Duplicate Detection as a Service

Opdenplatz, Juliette, Şimşek, Umutcan, Fensel, Dieter

arXiv.org Artificial IntelligenceJul-20-2022

Completeness of a knowledge graph is an important quality dimension and factor on how well an application that makes use of it performs. Completeness can be improved by performing knowledge enrichment. Duplicate detection aims to find identity links between the instances of knowledge graphs and is a fundamental subtask of knowledge enrichment. Current solutions to the problem require expert knowledge of the tool and the knowledge graph they are applied to. Users might not have this expert knowledge. We present our service-based approach to the duplicate detection task that provides an easy-to-use no-code solution that is still competitive with the state-of-the-art and has recently been adopted in an industrial context. The evaluation will be based on several frequently used test scenarios.

artificial intelligence, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2207.09672

Country:

Europe > Austria > Tyrol > Innsbruck (0.04)
North America > United States > California > Alameda County > Berkeley (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
Europe > France > Occitanie > Hérault > Montpellier (0.04)

Genre: Research Report (0.82)

Technology:

Information Technology > Information Management (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A*+BFHS: A Hybrid Heuristic Search Algorithm

Bu, Zhaoxing, Korf, Richard E.

arXiv.org Artificial IntelligenceMar-23-2021

We present a new algorithm A*+BFHS for solving hard problems where A* and IDA* fail due to memory limitations and/or the existence of many short cycles. A*+BFHS is based on A* and breadth-first heuristic search (BFHS). A*+BFHS combines advantages from both algorithms, namely A*'s node ordering, BFHS's memory savings, and both algorithms' duplicate detection. On easy problems, A*+BFHS behaves the same as A*. On hard problems, it is slower than A* but saves a large amount of memory. Compared to BFIDA*, A*+BFHS reduces the search time and/or memory requirement by several times on a variety of planning domains.

bfh, iteration, node, (17 more...)

arXiv.org Artificial Intelligence

2103.12701

Country:

North America > United States > California > Los Angeles County > Los Angeles (0.28)
Asia > Vietnam > Hanoi > Hanoi (0.04)

Genre: Research Report (0.50)

Technology: Information Technology > Artificial Intelligence > Representation & Reasoning > Search (1.00)

Add feedback